We’ll use data from http://www.tfl.gov.uk to analyse usage of the London
Bike Sharing scheme. This data has already been downloaded for you and
exists in a CSV (Comma Separated Values) file that you have
to read in to R.
There is no dropdown menu to read in your data to R, so instead we
use functions like read_csv to load data from file into R
objects.
#read the CSV file
bike <- read_csv(here::here("data", "london_bikes.csv")) %>%
mutate(weekend = if_else((wday == "Sat" | wday == "Sun"), TRUE, FALSE))Sometimes our data needs a bit of ‘cleaning’. For instance,
day_of_week is variable type character, or
chr. We should, however, treat it as a categorical, or
factor variable and relevel it, so Monday is the first
level of the factor (or first day of week), etc.
R is fairly sensitive with dates. When you read a CSV file, the date
may be in different formats. For instance, Christmas 2017 could be input
as 12-25-2017, 25.12.2017, 25 Dec 2017, Dec 25, 2017, etc. To be
consistent, we use lubridate’s ymd function, and force
variable Day to be a date in the format YYYY-MM-DD
Finally, we can turn season from 1, 2, 3, 4, to words
like Winter, Spring, etc.
We will be talking more about data wrangling later, but for now just execute the following piece of code.
# fix dates using lubridate, and generate new variables for year, month, month_name, day, and day_of _week
bike <- bike %>%
mutate(
year=year(date),
month = month(date),
month_name=month(date, label = TRUE),
day_of_week = wday(date, label = TRUE))
# generate new variable season_name to turn seasons from numbers to Winter, Spring, etc
bike <- bike %>%
mutate(
season_name = case_when(
month_name %in% c("Dec", "Jan", "Feb") ~ "Winter",
month_name %in% c("Mar", "Apr", "May") ~ "Spring",
month_name %in% c("Jun", "Jul", "Aug") ~ "Summer",
month_name %in% c("Sep", "Oct", "Nov") ~ "Autumn",
),
season_name = factor(season_name,
levels = c("Winter", "Spring", "Summer", "Autumn")))
#examine the structure of the datafame
skim(bike)| Name | bike |
| Number of rows | 4750 |
| Number of columns | 20 |
| _______________________ | |
| Column type frequency: | |
| character | 1 |
| factor | 3 |
| logical | 1 |
| numeric | 14 |
| POSIXct | 1 |
| ________________________ | |
| Group variables | None |
Variable type: character
| skim_variable | n_missing | complete_rate | min | max | empty | n_unique | whitespace |
|---|---|---|---|---|---|---|---|
| wday | 0 | 1 | 3 | 3 | 0 | 7 | 0 |
Variable type: factor
| skim_variable | n_missing | complete_rate | ordered | n_unique | top_counts |
|---|---|---|---|---|---|
| month_name | 0 | 1 | TRUE | 12 | Jul: 405, Jan: 403, Mar: 403, May: 403 |
| day_of_week | 0 | 1 | TRUE | 7 | Sun: 679, Mon: 679, Fri: 679, Sat: 679 |
| season_name | 0 | 1 | FALSE | 4 | Sum: 1198, Spr: 1196, Aut: 1183, Win: 1173 |
Variable type: logical
| skim_variable | n_missing | complete_rate | mean | count |
|---|---|---|---|---|
| weekend | 0 | 1 | 0.29 | FAL: 3392, TRU: 1358 |
Variable type: numeric
| skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
|---|---|---|---|---|---|---|---|---|---|---|
| bikes_hired | 0 | 1.00 | 26607.16 | 9744.66 | 0.0 | 19700.75 | 26349.5 | 33623.0 | 73094.0 | ▂▇▆▁▁ |
| year | 0 | 1.00 | 2016.58 | 3.78 | 2010.0 | 2013.00 | 2017.0 | 2020.0 | 2023.0 | ▆▇▆▇▇ |
| month | 0 | 1.00 | 6.52 | 3.45 | 1.0 | 4.00 | 7.0 | 10.0 | 12.0 | ▇▅▅▅▇ |
| week | 0 | 1.00 | 26.58 | 15.05 | 1.0 | 14.00 | 27.0 | 40.0 | 53.0 | ▇▇▇▇▇ |
| cloud_cover | 33 | 0.99 | 4.90 | 2.37 | 0.0 | 3.00 | 5.0 | 7.0 | 9.0 | ▃▅▆▇▃ |
| humidity | 52 | 0.99 | 75.56 | 11.28 | 33.0 | 67.00 | 77.0 | 84.0 | 100.0 | ▁▂▆▇▃ |
| pressure | 31 | 0.99 | 10154.60 | 104.04 | 9731.0 | 10093.00 | 10163.0 | 10225.0 | 10477.0 | ▁▂▇▇▁ |
| radiation | 40 | 0.99 | 119.71 | 89.86 | 2.0 | 41.00 | 96.0 | 188.0 | 402.0 | ▇▅▃▂▁ |
| precipitation | 31 | 0.99 | 17.11 | 37.78 | 0.0 | 0.00 | 0.0 | 16.0 | 516.0 | ▇▁▁▁▁ |
| snow_depth | 302 | 0.94 | 0.02 | 0.24 | 0.0 | 0.00 | 0.0 | 0.0 | 7.0 | ▇▁▁▁▁ |
| sunshine | 31 | 0.99 | 41.53 | 39.15 | 0.0 | 5.00 | 33.0 | 67.0 | 147.0 | ▇▃▂▂▁ |
| mean_temp | 31 | 0.99 | 12.04 | 5.71 | -4.1 | 7.70 | 11.9 | 16.6 | 30.9 | ▁▇▇▅▁ |
| min_temp | 31 | 0.99 | 8.00 | 5.26 | -9.4 | 4.10 | 8.2 | 12.1 | 22.3 | ▁▃▇▇▁ |
| max_temp | 31 | 0.99 | 16.06 | 6.60 | -1.2 | 11.00 | 15.7 | 21.0 | 40.2 | ▂▇▇▂▁ |
Variable type: POSIXct
| skim_variable | n_missing | complete_rate | min | max | median | n_unique |
|---|---|---|---|---|---|---|
| date | 0 | 1 | 2010-07-30 | 2023-07-31 | 2017-01-28 12:00:00 | 4750 |
Besides the number of bikes hired each day, we also have data on the weather for that day
Having loaded and cleaned our data, we can create summary statistics
using mosaic’s favstats. This is uni-variate analysis, so
in our mosaic syntax we will use
favstats(~ bikes_hired, data= bike). We also want to get an
overall, time-series plot of bikes over time; for the latter, we just
create a scatter plot of bikes_hired vs
Day.
| min | Q1 | median | Q3 | max | mean | sd | n | missing |
|---|---|---|---|---|---|---|---|---|
| 0 | 1.97e+04 | 2.63e+04 | 3.36e+04 | 7.31e+04 | 2.66e+04 | 9.74e+03 | 4750 | 0 |
While this analysis shows us overall statistics, what if we wanted to
get summary statistics by year, day_of_week,
month, or season? Mosaic’s syntax
Y ~ X alows us to facet our analysis of a variable Y by
another variable X using the syntax
favstats( Y ~ Z, data=...)
| year | min | Q1 | median | Q3 | max | mean | sd | n | missing |
|---|---|---|---|---|---|---|---|---|---|
| 2010 | 2.76e+03 | 9.3e+03 | 1.4e+04 | 1.87e+04 | 2.8e+04 | 1.41e+04 | 5.62e+03 | 155 | 0 |
| 2011 | 4.56e+03 | 1.63e+04 | 2.03e+04 | 2.37e+04 | 2.94e+04 | 1.96e+04 | 5.5e+03 | 365 | 0 |
| 2012 | 3.53e+03 | 1.93e+04 | 2.62e+04 | 3.25e+04 | 4.71e+04 | 2.6e+04 | 9.43e+03 | 366 | 0 |
| 2013 | 3.73e+03 | 1.76e+04 | 2.2e+04 | 2.74e+04 | 3.56e+04 | 2.2e+04 | 7.28e+03 | 365 | 0 |
| 2014 | 4.33e+03 | 2.05e+04 | 2.77e+04 | 3.44e+04 | 4.9e+04 | 2.75e+04 | 9.07e+03 | 365 | 0 |
| 2015 | 5.78e+03 | 2.21e+04 | 2.66e+04 | 3.29e+04 | 7.31e+04 | 2.7e+04 | 8.55e+03 | 365 | 0 |
| 2016 | 4.89e+03 | 2.24e+04 | 2.79e+04 | 3.51e+04 | 4.66e+04 | 2.82e+04 | 8.85e+03 | 366 | 0 |
| 2017 | 5.14e+03 | 2.41e+04 | 2.95e+04 | 3.45e+04 | 4.6e+04 | 2.86e+04 | 8.38e+03 | 365 | 0 |
| 2018 | 5.86e+03 | 2.18e+04 | 2.92e+04 | 3.77e+04 | 4.61e+04 | 2.9e+04 | 1.02e+04 | 365 | 0 |
| 2019 | 5.65e+03 | 2.42e+04 | 2.89e+04 | 3.45e+04 | 4.47e+04 | 2.86e+04 | 8.09e+03 | 365 | 0 |
| 2020 | 4.87e+03 | 2.01e+04 | 2.75e+04 | 3.65e+04 | 7.02e+04 | 2.85e+04 | 1.16e+04 | 366 | 0 |
| 2021 | 6.25e+03 | 2.17e+04 | 3.1e+04 | 3.82e+04 | 5.69e+04 | 3e+04 | 1.1e+04 | 365 | 0 |
| 2022 | 0 | 2.51e+04 | 3.14e+04 | 4e+04 | 6.7e+04 | 3.15e+04 | 1.03e+04 | 365 | 0 |
| 2023 | 9.26e+03 | 1.99e+04 | 2.33e+04 | 2.76e+04 | 3.53e+04 | 2.35e+04 | 5.64e+03 | 212 | 0 |
| day_of_week | min | Q1 | median | Q3 | max | mean | sd | n | missing |
|---|---|---|---|---|---|---|---|---|---|
| Sun | 0 | 1.41e+04 | 2.13e+04 | 3.04e+04 | 6.31e+04 | 2.26e+04 | 1.08e+04 | 679 | 0 |
| Mon | 3.97e+03 | 2.05e+04 | 2.59e+04 | 3.21e+04 | 6.7e+04 | 2.63e+04 | 8.53e+03 | 679 | 0 |
| Tue | 3.76e+03 | 2.25e+04 | 2.79e+04 | 3.49e+04 | 6.53e+04 | 2.84e+04 | 8.83e+03 | 678 | 0 |
| Wed | 4.33e+03 | 2.26e+04 | 2.8e+04 | 3.47e+04 | 5.44e+04 | 2.85e+04 | 8.79e+03 | 678 | 0 |
| Thu | 5.65e+03 | 2.26e+04 | 2.77e+04 | 3.52e+04 | 7.31e+04 | 2.85e+04 | 9.01e+03 | 678 | 0 |
| Fri | 5.4e+03 | 2.14e+04 | 2.69e+04 | 3.41e+04 | 6.7e+04 | 2.73e+04 | 8.77e+03 | 679 | 0 |
| Sat | 0 | 1.6e+04 | 2.28e+04 | 3.27e+04 | 7.02e+04 | 2.46e+04 | 1.14e+04 | 679 | 0 |
| month_name | min | Q1 | median | Q3 | max | mean | sd | n | missing |
|---|---|---|---|---|---|---|---|---|---|
| Jan | 3.73e+03 | 1.39e+04 | 1.9e+04 | 2.34e+04 | 3.8e+04 | 1.87e+04 | 6.09e+03 | 403 | 0 |
| Feb | 3.53e+03 | 1.6e+04 | 2.07e+04 | 2.46e+04 | 5.25e+04 | 2.04e+04 | 6.53e+03 | 367 | 0 |
| Mar | 5.06e+03 | 1.77e+04 | 2.34e+04 | 2.76e+04 | 5.66e+04 | 2.29e+04 | 7.74e+03 | 403 | 0 |
| Apr | 4.87e+03 | 2.11e+04 | 2.62e+04 | 3.13e+04 | 4.9e+04 | 2.62e+04 | 7.78e+03 | 390 | 0 |
| May | 1.2e+04 | 2.46e+04 | 3.04e+04 | 3.65e+04 | 7.02e+04 | 3.07e+04 | 8.56e+03 | 403 | 0 |
| Jun | 6.06e+03 | 2.79e+04 | 3.39e+04 | 3.96e+04 | 6.53e+04 | 3.37e+04 | 8.77e+03 | 390 | 0 |
| Jul | 5.56e+03 | 2.99e+04 | 3.69e+04 | 4.17e+04 | 7.31e+04 | 3.52e+04 | 8.37e+03 | 405 | 0 |
| Aug | 4.3e+03 | 2.7e+04 | 3.43e+04 | 3.93e+04 | 6.7e+04 | 3.21e+04 | 9.95e+03 | 403 | 0 |
| Sep | 0 | 2.5e+04 | 3.23e+04 | 3.67e+04 | 5.18e+04 | 3.08e+04 | 8.47e+03 | 390 | 0 |
| Oct | 7.07e+03 | 2.33e+04 | 2.84e+04 | 3.3e+04 | 4.79e+04 | 2.77e+04 | 6.99e+03 | 403 | 0 |
| Nov | 6.03e+03 | 1.87e+04 | 2.37e+04 | 2.81e+04 | 4.47e+04 | 2.33e+04 | 6.64e+03 | 390 | 0 |
| Dec | 2.76e+03 | 1.17e+04 | 1.67e+04 | 2.31e+04 | 3.91e+04 | 1.72e+04 | 7.03e+03 | 403 | 0 |
| season_name | min | Q1 | median | Q3 | max | mean | sd | n | missing |
|---|---|---|---|---|---|---|---|---|---|
| Winter | 2.76e+03 | 1.38e+04 | 1.89e+04 | 2.37e+04 | 5.25e+04 | 1.87e+04 | 6.68e+03 | 1173 | 0 |
| Spring | 4.87e+03 | 2.08e+04 | 2.65e+04 | 3.18e+04 | 7.02e+04 | 2.66e+04 | 8.66e+03 | 1196 | 0 |
| Summer | 4.3e+03 | 2.81e+04 | 3.48e+04 | 4e+04 | 7.31e+04 | 3.37e+04 | 9.13e+03 | 1198 | 0 |
| Autumn | 0 | 2.18e+04 | 2.77e+04 | 3.31e+04 | 5.18e+04 | 2.73e+04 | 8.01e+03 | 1183 | 0 |
While summary statistics allow us to quickly disover key metrics that represent the data, they are often not sufficient by themselves. Often, it is useful to represent the data graphically (or visually) to uncover information that is critical for decision making, such as trends, patterns and outliers.
In this section we will create a time-series scatter-plot that shows
the number of bikes hired on each day. We will use the
ggplot2 library.
Creating a basic plot is a two-step process. The first step is the
specify the data frame and the axes by using the syntax:
ggplot(data frame, aes(x=variable, y=variable). The second
step is to specify the plot that we want. In this case, we want a
scatter (point) plot using geom_point() after a
+ sign.
## `geom_smooth()` using method = 'gam' and formula = 'y ~ s(x, bs = "cs")'
If you were a consultant or an analyst, would you be comfortable showing the plot you created in the previous section to your client? Think about it. Probably not.
In this section, we will refine our plot to make it presentable (and to communicate the information more effectively).
Without a title, it becomes harder to comprehend the information presented in the plot. In addition, the labels of the X and Y axes are set to the respective variable (column) names by default, and are not formatted in the same way. This is unprofessional, and in some cases, confusing.
We start by using the labs() function to specify the
title, the X-axis label and the Y-axis label. Notice that we are simply
adding another layer to the previous plot by using the +
symbol.
ggplot(bike, aes(x=date, y=bikes_hired))+
geom_point(alpha = 0.3)+
geom_smooth()+
theme_bw()+
labs(
title = "TfL Daily Bike Rentals",
x = "Day",
y = "Number of Bikes Hired"
)+
NULL## `geom_smooth()` using method = 'gam' and formula = 'y ~ s(x, bs = "cs")'
The plot looks much better, but we can further improve the look of
the plot using the default themes in R. The graph below uses a default
theme called theme_light().
ggplot(bike, aes(x=date, y=bikes_hired))+
geom_point(alpha=0.5) +
geom_smooth(color="red", size=1.5) +
theme_bw()+
labs(
title = "TfL Daily Bike Rentals",
x = "Day",
y = "Number of Bikes Hired"
)+
theme_light()+
NULL## `geom_smooth()` using method = 'gam' and formula = 'y ~ s(x, bs = "cs")'
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
# Histogram faceted by season_name
ggplot(bike, aes(x=bikes_hired))+
geom_histogram()+
facet_wrap(~season_name)+
theme_bw()+
NULL## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
# Histogram faceted by month_name
ggplot(bike, aes(x=bikes_hired))+
geom_histogram()+
facet_wrap(~month_name)+
theme_bw()+
NULL## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
# Histogram faceted by month_name in 4 rows
ggplot(bike, aes(x=bikes_hired))+
geom_histogram()+
facet_wrap(~month_name, nrow = 4)+
theme_bw()+
NULL## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
# Density plot filled by season_name
ggplot(bike, aes(x=bikes_hired))+
geom_density(aes(fill=season_name), alpha = 0.3)+
theme_bw()+
NULL# Density plot filled by season_name, and faceted by season_name
ggplot(bike, aes(x=bikes_hired))+
geom_histogram(aes(fill=season_name), alpha = 0.5)+
facet_wrap(~season_name, nrow = 4)+
theme_minimal()+
#remove legend to the right
theme(legend.position = "none")+
NULL## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
# Density plot filled by season_name, and faceted by month_name
ggplot(bike, aes(x=bikes_hired))+
geom_density(aes(fill=season_name), alpha = 0.3)+
facet_wrap(~month_name, nrow = 4)+
theme_bw()+
theme(legend.position="none")+
NULL#Boxplot of bikes_hired by month
# since 'month' is a number, it treats it as a continuous variable; hence we get just one box
ggplot(bike, aes(x=month, y= bikes_hired))+
geom_boxplot()+
theme_bw()+
NULL## Warning: Continuous x aesthetic
## ℹ did you forget `aes(group = ...)`?
#Boxplot by month_name
ggplot(bike, aes(x=month_name, y= bikes_hired))+
geom_boxplot()+
theme_bw()+
NULL#Boxplot by month_name
ggplot(bike, aes(x=month_name, y= bikes_hired, fill=season_name))+
geom_boxplot()+
theme_bw()+
NULL#Violin plot by month_name
ggplot(bike, aes(x=month_name, y= bikes_hired))+
geom_violin()+
theme_bw()+
NULL# bikes_hired vs. weather features
ggplot(bike, aes(x=mean_temp, y= bikes_hired))+
geom_point()+
geom_smooth(method = "lm", se=FALSE)+
theme_bw()+
NULL## `geom_smooth()` using formula = 'y ~ x'
## Warning: Removed 31 rows containing non-finite values (`stat_smooth()`).
## Warning: Removed 31 rows containing missing values (`geom_point()`).
ggplot(bike, aes(x=mean_temp, y= bikes_hired, colour=season_name))+
geom_point(alpha = 0.2)+
geom_smooth(method = "lm", se=FALSE)+
theme_bw()+
# facet_wrap(~season_name, ncol=1)+
NULL## `geom_smooth()` using formula = 'y ~ x'
## Warning: Removed 31 rows containing non-finite values (`stat_smooth()`).
## Removed 31 rows containing missing values (`geom_point()`).
temperature_by_season <- ggplot(bike, aes(x=mean_temp, y= bikes_hired,colour=season_name)) +
# rather than using geom_point(), we use geom_point_interactive()
geom_point_interactive(aes(
tooltip = glue::glue("Mean Temp: {mean_temp}\nBikes Hired: {bikes_hired}\nDate: {date}")),
alpha = 0.3) +
geom_smooth_interactive(se = FALSE, method = lm)+
theme_bw()+
facet_wrap(~season_name, ncol=1)+
# facet_grid(season_name ~ weekend)+
theme(legend.position = "none")+
NULL
# you have created the ggplot object, you now pass it to
girafe(
ggobj = temperature_by_season
)## `geom_smooth()` using formula = 'y ~ x'
## Warning: Removed 31 rows containing non-finite values (`stat_smooth()`).
## Warning: Removed 31 rows containing missing values
## (`geom_interactive_point()`).
###### bikes on humidity
ggplot(bike, aes(x=humidity, y= bikes_hired))+
geom_point()+
geom_smooth(method = "lm")+
theme_bw()+
NULL## `geom_smooth()` using formula = 'y ~ x'
## Warning: Removed 52 rows containing non-finite values (`stat_smooth()`).
## Warning: Removed 52 rows containing missing values (`geom_point()`).
Besides the number of bikes rented out on a given day, we have also
downloaded weather data for London such as mean_temp,
humidity, pressure,
precipitation, etc. that measure weather conditions on a
single day. It may be the case that more bikes are rented out when it’s
warmer? Or how can we estimate the effect of rain on rentals?
Your task is to build a regression model that helps you explain the number of rentals per day.
Let us select a few of these numerical variables and create a scatterplot-correlation matrix
bike %>%
select(cloud_cover, humidity, pressure, radiation, precipitation, sunshine, mean_temp, bikes_hired) %>%
GGally::ggpairs()##
## Welch Two Sample t-test
##
## data: bikes_hired by weekend
## t = 12.511, df = 2068.6, p-value < 2.2e-16
## alternative hypothesis: true difference in means between group FALSE and group TRUE is not equal to 0
## 95 percent confidence interval:
## 3575.400 4904.607
## sample estimates:
## mean in group FALSE mean in group TRUE
## 27819.36 23579.36
bikes_hiredWe start the naive model where we just use the average to predict how many bikes we are going to rent out on a single day
| min | Q1 | median | Q3 | max | mean | sd | n | missing |
|---|---|---|---|---|---|---|---|---|
| 0 | 1.97e+04 | 2.63e+04 | 3.36e+04 | 7.31e+04 | 2.66e+04 | 9.74e+03 | 4750 | 0 |
# can you create a confidence interval for mean bikes_hired? What is the SE?
model0 <- lm(bikes_hired ~ 1, data= bike)
msummary(model0)## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 26607.2 141.4 188.2 <2e-16 ***
##
## Residual standard error: 9745 on 4749 degrees of freedom
What is the regression’s residual standard error? What is the intercept standard error?
bikes_hired on mean_temp# Define the model
model1 <- lm(bikes_hired ~ mean_temp, data = bike)
# look at model estimated
msummary(model1)## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 13555.40 256.90 52.77 <2e-16 ***
## mean_temp 1084.61 19.29 56.24 <2e-16 ***
##
## Residual standard error: 7558 on 4717 degrees of freedom
## (31 observations deleted due to missingness)
## Multiple R-squared: 0.4014, Adjusted R-squared: 0.4012
## F-statistic: 3163 on 1 and 4717 DF, p-value: < 2.2e-16
mean_temp significant? Why?bikes_hired on mean_temp plus
weekend# Define the model
model2 <- lm(bikes_hired ~ mean_temp + weekend, data = bike)
# look at model estimated
msummary(model2)## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 14761.58 257.98 57.22 <2e-16 ***
## mean_temp 1083.44 18.68 58.00 <2e-16 ***
## weekendTRUE -4173.41 235.89 -17.69 <2e-16 ***
##
## Residual standard error: 7320 on 4716 degrees of freedom
## (31 observations deleted due to missingness)
## Multiple R-squared: 0.4386, Adjusted R-squared: 0.4384
## F-statistic: 1842 on 2 and 4716 DF, p-value: < 2.2e-16
Fit a regression model called model2 with the
following explanatory variables: mean_temp and
weekend
mean_temp significant? Why?weekendTRUE?
What % of the variability of bikes_hired does your model explain?bikes_hired on mean_temp plus
wday# Define the model
model3 <- lm(bikes_hired ~ mean_temp + wday, data = bike)
# look at model estimated
msummary(model3)## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 14268.95 358.68 39.782 < 2e-16 ***
## mean_temp 1082.30 18.56 58.326 < 2e-16 ***
## wdayMon -911.51 395.96 -2.302 0.02138 *
## wdaySat -2703.61 395.95 -6.828 9.69e-12 ***
## wdaySun -4630.43 395.95 -11.694 < 2e-16 ***
## wdayThu 1124.91 395.95 2.841 0.00452 **
## wdayTue 1169.70 395.95 2.954 0.00315 **
## wdayWed 1149.84 395.95 2.904 0.00370 **
##
## Residual standard error: 7271 on 4711 degrees of freedom
## (31 observations deleted due to missingness)
## Multiple R-squared: 0.4466, Adjusted R-squared: 0.4458
## F-statistic: 543.2 on 7 and 4711 DF, p-value: < 2.2e-16
What is the meaning of the effect (slope) of, e.g.,
wdayMon? What % of the variability of bikes_hired does your
model explain?
Our dataset has many more variables, so here are some ideas on how you can extend your analysis
bikes_hired?Let us say that your best model is model3. How do the
predictions of your model compare to the actual data?
# plot actual data vs predicted data
my_best_model <- broom::augment(model3) %>%
mutate(day=row_number())
# Plot fitted line and residuals
ggplot(my_best_model, aes(x=day, y = bikes_hired)) +
geom_point(alpha = 0.2) +
geom_point(aes(y = .fitted),
shape = 1,
colour = "red", alpha = 0.2)+
theme_bw()As you keep building your models, it makes sense to:
performance::check_model(model_x). You will always have
some deviation from normality, especially for very high values of
ncar::vif(model_x) to calculate the
Variance Inflation Factor (VIF) for your predictors and
determine whether you have colinear variables. A general guideline is
that a VIF larger than 10 is large, and your model may suffer from
collinearity. Remove the variable in question and run your model again
without it.Create a summary table, using huxtable (https://am01-sep23.netlify.app/example/modelling_side_by_side_tables/)
that shows which models you worked on, which predictors are significant,
the adjusted \(R^2\), and the Residual
Standard Error. If you want to add more models, just make sure you do
not forget the comma , after the last model, as shown
below
# produce summary table comparing models using huxtable::huxreg()
huxreg(model0, model1, model2, model3,
statistics = c('#observations' = 'nobs',
'R squared' = 'r.squared',
'Adj. R Squared' = 'adj.r.squared',
'Residual SE' = 'sigma'),
bold_signif = 0.05
) %>%
set_caption('Comparison of models')| (1) | (2) | (3) | (4) | |
|---|---|---|---|---|
| (Intercept) | 26607.165 *** | 13555.399 *** | 14761.577 *** | 14268.948 *** |
| (141.390) | (256.902) | (257.977) | (358.679) | |
| mean_temp | 1084.607 *** | 1083.442 *** | 1082.295 *** | |
| (19.287) | (18.679) | (18.556) | ||
| weekendTRUE | -4173.405 *** | |||
| (235.893) | ||||
| wdayMon | -911.506 * | |||
| (395.959) | ||||
| wdaySat | -2703.608 *** | |||
| (395.950) | ||||
| wdaySun | -4630.427 *** | |||
| (395.955) | ||||
| wdayThu | 1124.915 ** | |||
| (395.949) | ||||
| wdayTue | 1169.699 ** | |||
| (395.953) | ||||
| wdayWed | 1149.836 ** | |||
| (395.949) | ||||
| #observations | 4750 | 4719 | 4719 | 4719 |
| R squared | 0.000 | 0.401 | 0.439 | 0.447 |
| Adj. R Squared | 0.000 | 0.401 | 0.438 | 0.446 |
| Residual SE | 9744.660 | 7558.215 | 7320.001 | 7271.345 |
| *** p < 0.001; ** p < 0.01; * p < 0.05. | ||||